Skip to content

ENH: pd.read_html argument to extract hrefs along with text from cells #45973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Aug 16, 2022

Conversation

abmyii
Copy link
Contributor

@abmyii abmyii commented Feb 13, 2022

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix the pre-commit checks?

@abmyii abmyii force-pushed the read_html-extract-hrefs branch from 847d930 to d69ce74 Compare February 14, 2022 12:32
@abmyii
Copy link
Contributor Author

abmyii commented Feb 14, 2022

Could you please fix the pre-commit checks?

Ah I missed that error - sorry about that. I've fixed it now.

@abmyii abmyii force-pushed the read_html-extract-hrefs branch 2 times, most recently from 50aab0d to 70cf3fa Compare February 15, 2022 17:37
@abmyii abmyii force-pushed the read_html-extract-hrefs branch from 70cf3fa to a13c5f0 Compare February 15, 2022 21:59
@abmyii abmyii requested a review from phofl February 15, 2022 22:55
@attack68
Copy link
Contributor

This looks fine to me, albeit there is some agreement on the name of the new kwarg. Not convinced by extract_hrefs, but I can't right now suggest anything better.

@abmyii
Copy link
Contributor Author

abmyii commented Feb 16, 2022

This looks fine to me, albeit there is some agreement on the name of the new kwarg. Not convinced by extract_hrefs, but I can't right now suggest anything better.

I felt it was more explicit than extract_links - but perhaps that is a better name?

@abmyii abmyii requested a review from attack68 February 17, 2022 15:44
@abmyii abmyii requested a review from attack68 February 23, 2022 19:17
@jreback jreback added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Feb 26, 2022
@@ -585,6 +624,10 @@ def _parse_tables(self, doc, match, attrs):
raise ValueError(f"No tables found matching pattern {repr(match.pattern)}")
return result

def _href_getter(self, obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type the args and returns of all of the added code

Copy link
Contributor Author

@abmyii abmyii Feb 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've typed the returns, but won't lxml/bs4 be required to type the args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attack68 What shall I do about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Sorry to bother you, but I haven't been able to come up with a solution for this. Could you please suggest how I should do it?

To elaborate a bit on my first comment - the requirements may not be installed, and in that case the typing using the custom types defined in the libraries would fail (as far as I understand), so that doesn't seem like a viable solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Would you be able to enlighten me regarding this request? I'm still at a loss as to how to approach it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the top of the file you can do:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from bf4/lxml import ...

Then type obj. The CI checks have all the optional dependencies installed so these checks should be available.

@abmyii abmyii requested a review from jreback February 28, 2022 17:39
Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, except I think you need to test all the combinations "header" "footer" etc.

It would be good if you can put it all into one test and use. @pytest.mark.parametrize

Also have you rendered the docs. I think your version added was in the wrong place so it would be good to post a rendering to show the docs are working correctly.

@abmyii
Copy link
Contributor Author

abmyii commented Mar 18, 2022

@attack68 Could you help with this request from jreback? I'm not sure how to go about testing it.

need a test that validatees the extract_href arg is among the required values or raises.

@abmyii abmyii force-pushed the read_html-extract-hrefs branch from f418975 to 490005a Compare June 18, 2022 10:09
@abmyii
Copy link
Contributor Author

abmyii commented Jun 18, 2022

lgtm, if rebased and passing tests

Rebased and tests are passing. I've kept the extract_links index structure custom (returns a flat tuple index rather than a MultiIndex) for now, but I'm open to reverting that.

I'm not sure what is causing the Docstring action to fail - the line is totally different to what is shown in the error.

@abmyii abmyii requested a review from attack68 June 30, 2022 12:04
@attack68
Copy link
Contributor

attack68 commented Jul 2, 2022

this is failing checks.eg pandas/io/html.py line 1024. you cab review the logs for explanations.

@abmyii
Copy link
Contributor Author

abmyii commented Jul 2, 2022

this is failing checks.eg pandas/io/html.py line 1024. you cab review the logs for explanations.

https://github.com/abmyii/pandas/blob/read_html-extract-hrefs/pandas/io/html.py#L1024 is a blank line and 1025 is not a docstring?

Is https://github.com/abmyii/pandas/blob/read_html-extract-hrefs/pandas/io/html.py#L1141 the issue - and if so, should it be .. versionadded :: 1.5.0?

Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc edits

@abmyii abmyii requested a review from mroeschke July 30, 2022 11:06
@abmyii abmyii force-pushed the read_html-extract-hrefs branch from c6b7ea1 to 4c7f532 Compare August 1, 2022 18:14
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fairly good. Just one small comment and a merge conflict

@abmyii abmyii requested a review from mroeschke August 16, 2022 01:03
@mroeschke mroeschke added this to the 1.5 milestone Aug 16, 2022
@mroeschke mroeschke merged commit 9f81aa6 into pandas-dev:main Aug 16, 2022
@mroeschke
Copy link
Member

Thanks for sticking with this @abmyii!

@abmyii
Copy link
Contributor Author

abmyii commented Aug 16, 2022

Thanks for sticking with this @abmyii!

Awesome, thank you to all the reviewers, and you and @attack68 especially for helping me throughout this process!

YYYasin19 pushed a commit to YYYasin19/pandas that referenced this pull request Aug 23, 2022
pandas-dev#45973)

* ENH: pd.read_html argument to extract hrefs along with text from cells

* Fix typing error

* Simplify tests

* Fix still incorrect typing

* Summarise whatsnew entry and move detailed explanation into user guide

* More flexible link extraction

* Suggested changes

* extract_hrefs -> extract_links

* Move versionadded to correct place and improve docstring for extract_links (@attack68)

* Test for invalid extract_links value

* Test all extract_link options

* Fix for MultiIndex headers (also fixes tests)

* Test that text surrounding <a> tag is still captured

* Test for multiple <a> tags in cell

* Fix all tests, with both MultiIndex -> Index and np.nan -> None conversions resolved

* Add back EOF newline to test_html.py

* Correct user guide example

* Update pandas/io/html.py

* Update pandas/io/html.py

* Update pandas/io/html.py

* Simplify MultiIndex -> Index conversion

* Move unnecessary fixtures into test body

* Simplify statement

* Fix code checks

Co-authored-by: JHM Darbyshire <[email protected]>
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
pandas-dev#45973)

* ENH: pd.read_html argument to extract hrefs along with text from cells

* Fix typing error

* Simplify tests

* Fix still incorrect typing

* Summarise whatsnew entry and move detailed explanation into user guide

* More flexible link extraction

* Suggested changes

* extract_hrefs -> extract_links

* Move versionadded to correct place and improve docstring for extract_links (@attack68)

* Test for invalid extract_links value

* Test all extract_link options

* Fix for MultiIndex headers (also fixes tests)

* Test that text surrounding <a> tag is still captured

* Test for multiple <a> tags in cell

* Fix all tests, with both MultiIndex -> Index and np.nan -> None conversions resolved

* Add back EOF newline to test_html.py

* Correct user guide example

* Update pandas/io/html.py

* Update pandas/io/html.py

* Update pandas/io/html.py

* Simplify MultiIndex -> Index conversion

* Move unnecessary fixtures into test body

* Simplify statement

* Fix code checks

Co-authored-by: JHM Darbyshire <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] pandas.read_html argument to interpret hyperlinks as links (not merely text)
6 participants